Finding Biologically Accurate Clusterings in Hierarchical Decompositions Using the Variation of Information
نویسندگان
چکیده
Hierarchical clustering is a popular method for grouping together similar items based on a distance measure between them. These clusters can be used to infer annotations for uncharacterized items. However, in many cases, annotation information for some elements is known beforehand. We present a novel approach for decomposing a hierarchical clustering into the optimal clusters that match a set of known annotations, as measured by the variation of information metric. Our approach is general, and we apply it to two biological domains: finding protein complexes within protein interaction networks and identifying species within metagenomic DNA samples. For both applications, we test the quality of our clusters by using them to predict complex and species membership. We find that our approach generally outperforms the commonly used heuristic methods.
منابع مشابه
Finding Biologically Accurate Clusterings in Hierarchical Tree Decompositions Using the Variation of Information
Hierarchical clustering is a popular method for grouping together similar elements based on a distance measure between them. In many cases, annotations for some elements are known beforehand, which can aid the clustering process. We present a novel approach for decomposing a hierarchical clustering into the clusters that optimally match a set of known annotations, as measured by the variation o...
متن کاملانتخاب اعضای ترکیب در خوشهبندی ترکیبی با استفاده از رأیگیری
Clustering is the process of division of a dataset into subsets that are called clusters, so that objects within a cluster are similar to each other and different from objects of the other clusters. So far, a lot of algorithms in different approaches have been created for the clustering. An effective choice (can combine) two or more of these algorithms for solving the clustering problem. Ensemb...
متن کاملTemporal Hierarchical Clustering
We study hierarchical clusterings of metric spaces that change over time. This is a natural geometric primitive for the analysis of dynamic data sets. Specifically, we introduce and study the problem of finding a temporally coherent sequence of hierarchical clusterings from a sequence of unlabeled point sets. We encode the clustering objective by embedding each point set into an ultrametric spa...
متن کاملAnalysis and Optimization of Graph Decompositions by Lifted Multicuts
We study the set of all decompositions (clusterings) of a graph through its characterization as a set of lifted multicuts. This leads us to practically relevant insights related to the definition of classes of decompositions by must-join and must-cut constraints and related to the comparison of clusterings by metrics. To find optimal decompositions defined by minimum cost lifted multicuts, we e...
متن کاملComparing Clusterings by the Variation of Information
This paper proposes an information theoretic criterion for comparing two partitions, or clusterings, of the same data set. The criterion, called variation of information (VI), measures the amount of information lost and gained in changing from clustering C to clustering C′. The criterion makes no assumptions about how the clusterings were generated and applies to both soft and hard clusterings....
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2008